knitr::include_graphics("C:/Users/gam55/Downloads/gravity_bookstore_erd.png")


select * from book
Displaying records 1 - 10
book_id title isbn13 language_id num_pages publication_date publisher_id
1 The World’s First Love: Mary Mother of God 8987059752 2 276 1996-09-01 1010
2 The Illuminati 20049130001 1 352 2004-10-04 1967
3 The Servant Leader 23755004321 1 128 2003-03-11 1967
4 What Life Was Like in the Jewel in the Crown: British India AD 1600-1905 34406054602 1 168 1999-09-01 1978
5 Cliffs Notes on Aristophanes’ Lysistrata The Birds The Clouds The Frogs 49086007763 1 80 1983-12-29 416
6 Life Is a Dream and Other Spanish Classics (Eric Bentley’s Dramatic Repertoire) - Volume II 73999140774 1 298 2000-04-01 96
7 William Goldman: Four Screenplays 73999254907 2 504 2000-05-01 95
8 The Season: A Candid Look at Broadway 73999768442 1 448 2004-07-01 1136
9 The Beatles Complete - Updated Edition 73999960822 1 303 1986-12-01 835
10 Working Effectively with Legacy Code 76092025986 1 464 2004-09-01 1591
select * from book_language
Displaying records 1 - 10
language_id language_code language_name
1 eng English
2 en-US United States English
3 fre French
4 spa Spanish
5 en-GB British English
6 mul Multiple Languages
7 grc Greek
8 enm Middle English
9 en-CA Canadian English
10 ger German

1 - Indexes & Views

This section and the following take inspiration but not much content, from the LinkedIn learning course “Intermediate SQL for Data Scientists” - https://www.linkedin.com/learning/intermediate-sql-for-data-scientists

knitr::include_graphics("C:/Users/gam55/Downloads/gravity_bookstore_erd.png")

Indexes

Indexes in SQLite are structures designed to improve the speed of data retrieval. They’re similar to indexes in a book, allowing for faster access to rows in a table based on the values of one or more columns.

Types of Indexes -

  1. Single-Column Index - Created on a single column.

  2. Multi-Column Index - Created on two or more columns.

  3. Unique Index - Ensures that all values in the indexed column(s) are unique.

  4. Primary Key Index - Automatically created when a table has a primary key constraint.

  5. Automatic Indexes - Sometimes SQLite creates indexes automatically to optimize query performance, particularly for joins.


CREATE INDEX Customer_IDs ON cust_order(customer_id); 

CREATE INDEX Customer_Orders ON cust_order(order_id, customer_id);

CREATE UNIQUE INDEX unique_countries ON country(country_name);

This SQL code above fails - A unique index does not discard non-unique values or automatically filter out duplicates. Instead, it enforces a constraint that prevents duplicates from being inserted into the table in the first place.


CREATE UNIQUE INDEX unique_status_ids ON order_status(status_id);

List all indexes associated with a table -

PRAGMA index_list(cust_order);
2 records
seq name unique origin partial
0 Customer_Orders 0 c 0
1 Customer_IDs 0 c 0

Get information about a specific index -


PRAGMA index_info(Customer_Orders);
2 records
seqno cid name
0 0 order_id
1 2 customer_id

Indexes can be dropped using the DROP INDEX statement -


DROP INDEX Customer_Orders;

Checking the index deletion has worked -

PRAGMA index_list(cust_order);
1 records
seq name unique origin partial
0 Customer_IDs 0 c 0

An example with a more complex and intricate query -


select * from book
Displaying records 1 - 10
book_id title isbn13 language_id num_pages publication_date publisher_id
1 The World’s First Love: Mary Mother of God 8987059752 2 276 1996-09-01 1010
2 The Illuminati 20049130001 1 352 2004-10-04 1967
3 The Servant Leader 23755004321 1 128 2003-03-11 1967
4 What Life Was Like in the Jewel in the Crown: British India AD 1600-1905 34406054602 1 168 1999-09-01 1978
5 Cliffs Notes on Aristophanes’ Lysistrata The Birds The Clouds The Frogs 49086007763 1 80 1983-12-29 416
6 Life Is a Dream and Other Spanish Classics (Eric Bentley’s Dramatic Repertoire) - Volume II 73999140774 1 298 2000-04-01 96
7 William Goldman: Four Screenplays 73999254907 2 504 2000-05-01 95
8 The Season: A Candid Look at Broadway 73999768442 1 448 2004-07-01 1136
9 The Beatles Complete - Updated Edition 73999960822 1 303 1986-12-01 835
10 Working Effectively with Legacy Code 76092025986 1 464 2004-09-01 1591

select author_name from author where author_name == "Walter Scott"
1 records
author_name
Walter Scott

Creating an index of author names.


CREATE INDEX Author_names ON author(author_name);

Now our ‘server’ should use that index to optimise the performance of this query.


SELECT
    b.title,
    b.isbn13,
    b.num_pages,
    b.publication_date,
    a.author_name,
    ol.price
FROM book b
INNER JOIN book_author ba ON b.book_id = ba.book_id
INNER JOIN author a ON ba.author_id = a.author_id
INNER JOIN order_line ol ON b.book_id = ol.book_id
INNER JOIN cust_order co ON ol.order_id = co.order_id
WHERE a.author_name IN ('J.K. Rowling', 'Bill Bryson')
  AND strftime('%Y', b.publication_date) IN ('1997','1998', '1999','2000')
ORDER BY ol.price DESC;
8 records
title isbn13 num_pages publication_date author_name price
Harry Potter and the Goblet of Fire (Harry Potter #4) 9.780748e+12 636 2000-07-08 J.K. Rowling 15.04
Bill Bryson: The Complete Notes 9.780386e+12 544 2000-10-05 Bill Bryson 13.73
The Lost Continent: Travels in Small-town America 9.780553e+12 379 1999-01-02 Bill Bryson 12.07
Harry Potter and the Sorcerer’s Stone (Harry Potter #1) 9.780786e+12 424 1999-11-12 J.K. Rowling 9.99
The Lost Continent: Travels in Small-town America 9.780553e+12 379 1999-01-02 Bill Bryson 9.60
I’m a Stranger Here Myself: Notes on Returning to America After Twenty Years Away 9.780768e+12 304 2000-06-28 Bill Bryson 9.40
Bill Bryson: The Complete Notes 9.780386e+12 544 2000-10-05 Bill Bryson 2.71
Notes from a Small Island 9.780381e+12 324 1997-05-28 Bill Bryson 1.87

Views

Views in SQLite are virtual tables that provide a way to represent the results of a query as a table.

Reasons to use Views -

  1. Simplify Complex Queries - By encapsulating complex joins and calculations within a view, queries can be simpler and more understandable.

  2. Enhance Security - Views can restrict access to specific data by exposing only certain columns or rows to users.

  3. Provide Abstraction - Offer a layer of abstraction, allowing changes in the underlying database schema without affecting the end users.

  4. Improve Maintainability - Views centralise query logic, making the system easier to maintain and modify.

CREATE VIEW Bryson_Books AS
SELECT
    b.title,
    b.isbn13,
    b.num_pages,
    b.publication_date,
    a.author_name,
    ol.price
FROM book b
INNER JOIN book_author ba ON b.book_id = ba.book_id
INNER JOIN author a ON ba.author_id = a.author_id
INNER JOIN order_line ol ON b.book_id = ol.book_id
INNER JOIN cust_order co ON ol.order_id = co.order_id
WHERE a.author_name IN ('Bill Bryson');
CREATE VIEW Rowling_Books AS
SELECT
    b.title,
    b.isbn13,
    b.num_pages,
    b.publication_date,
    a.author_name,
    ol.price
FROM book b
INNER JOIN book_author ba ON b.book_id = ba.book_id
INNER JOIN author a ON ba.author_id = a.author_id
INNER JOIN order_line ol ON b.book_id = ol.book_id
INNER JOIN cust_order co ON ol.order_id = co.order_id
WHERE a.author_name IN ('J.K. Rowling');
CREATE VIEW Walter_Scott_Books AS
SELECT
    b.title,
    b.isbn13,
    b.num_pages,
    b.publication_date,
    a.author_name,
    ol.price
FROM book b
INNER JOIN book_author ba ON b.book_id = ba.book_id
INNER JOIN author a ON ba.author_id = a.author_id
INNER JOIN order_line ol ON b.book_id = ol.book_id
INNER JOIN cust_order co ON ol.order_id = co.order_id
WHERE a.author_name IN ('Walter Scott');
SELECT * FROM Bryson_Books;
Displaying records 1 - 10
title isbn13 num_pages publication_date author_name price
Bizarre World 9.780752e+12 120 2001-05-01 Bill Bryson 17.32
The Lost Continent: Travels in Small Town America 9.780061e+12 299 1990-08-28 Bill Bryson 14.36
Neither Here nor There: Travels in Europe 9.780381e+12 254 1993-03-28 Bill Bryson 12.47
Made in America: An Informal History of the English Language in the United States 9.780381e+12 364 2001-10-23 Bill Bryson 9.28
Notes from a Small Island 9.780381e+12 324 1997-05-28 Bill Bryson 1.87
Bill Bryson: The Complete Notes 9.780386e+12 544 2000-10-05 Bill Bryson 2.71
Bill Bryson: The Complete Notes 9.780386e+12 544 2000-10-05 Bill Bryson 13.73
A Short History of Nearly Everything (Illustrated Edition) 9.780386e+12 624 2010-10-05 Bill Bryson 5.64
A Short History of Nearly Everything (Illustrated Edition) 9.780386e+12 624 2010-10-05 Bill Bryson 19.05
Down Under 9.780553e+12 398 2001-08-06 Bill Bryson 10.92
SELECT * FROM Walter_Scott_Books;
6 records
title isbn13 num_pages publication_date author_name price
The Antiquary 9.780193e+12 528 2002-05-23 Walter Scott 10.66
The Antiquary 9.780193e+12 528 2002-05-23 Walter Scott 15.75
The Antiquary 9.780193e+12 528 2002-05-23 Walter Scott 0.84
The Antiquary 9.780193e+12 528 2002-05-23 Walter Scott 13.99
The Castle of Otranto 9.780193e+12 125 1998-07-16 Walter Scott 3.56
Waverley 9.780193e+12 463 1998-08-20 Walter Scott 1.96

Views themselves are not directly updatable, but they can be dropped and recreated -

DROP VIEW IF EXISTS Bryson_Books;
SELECT name FROM sqlite_master WHERE type='view';
2 records
name
Rowling_Books
Walter_Scott_Books
SELECT sql FROM sqlite_master WHERE type='view' AND name='Walter_Scott_Books';
1 records
sql
CREATE VIEW Walter_Scott_Books AS

SELECT b.title, b.isbn13, b.num_pages, b.publication_date, a.author_name, ol.price FROM book b INNER JOIN book_author ba ON b.book_id = ba.book_id INNER JOIN author a ON ba.author_id = a.author_id INNER JOIN order_line ol ON b.book_id = ol.book_id INNER JOIN cust_order co ON ol.order_id = co.order_id WHERE a.author_name IN (‘Walter Scott’) |


2 - Statistical aggregate functions

SUM(), AVG(), ROUND()

The SUM() function is an aggregate function that calculates the total sum of a numeric column. The function is commonly used in conjunction with the GROUP BY clause to calculate sums for specific groups of data.

knitr::include_graphics("C:/Users/gam55/Downloads/gravity_bookstore_erd.png")

SELECT SUM(street_name) as "No.different Street Names", SUM(city) as "No.different Cities"  FROM address;
1 records
No.different Street Names No.different Cities
87 0

As street_name and city are text columns, sum won’t work properly on them. It’s best to use COUNT(DISTINCT()).

SELECT COUNT(DISTINCT street_name) as "No. of Different Street Names", 
       COUNT(DISTINCT city) as "No. of Different Cities" 
FROM address;
1 records
No. of Different Street Names No. of Different Cities
958 974
knitr::include_graphics("C:/Users/gam55/Downloads/gravity_bookstore_erd.png")


SELECT 
    SUM(ol.price) AS "Sum of Orders per Customer",
    CONCAT(c.first_name, ' ', c.last_name) AS "Full Name"
FROM 
    order_line ol
INNER JOIN 
    cust_order co ON ol.order_id = co.order_id
INNER JOIN 
    customer c ON co.customer_id = c.customer_id
GROUP BY 
    c.first_name, c.last_name
ORDER BY 
    "Order per Customer" DESC;
Displaying records 1 - 10
Sum of Orders per Customer Full Name
112.44 Abbot Tesseyman
31.84 Abbott Kendrew
119.68 Abby Chevins
59.10 Abel Trower
88.54 Abigael Trowbridge
177.89 Abraham Skudder
104.84 Adah Cotty
143.15 Addie Basterfield
47.75 Addison Sigg
161.51 Adrian Kunzelmann

The AVG() function in SQL is an aggregate function that calculates the average value of a numeric column. It sums up all the values in the column and divides by the number of non-null values, providing the mean value.


SELECT 
    SUM(ol.price) AS "Sum of Orders per Customer",
    AVG(ol.price) AS "Average Order Price per Customer",
    CONCAT(c.first_name, ' ', c.last_name) AS "Full Name"
FROM 
    order_line ol
INNER JOIN 
    cust_order co ON ol.order_id = co.order_id
INNER JOIN 
    customer c ON co.customer_id = c.customer_id
GROUP BY 
    c.first_name, c.last_name
ORDER BY 
    "Order per Customer" DESC;
Displaying records 1 - 10
Sum of Orders per Customer Average Order Price per Customer Full Name
112.44 11.244000 Abbot Tesseyman
31.84 10.613333 Abbott Kendrew
119.68 13.297778 Abby Chevins
59.10 7.387500 Abel Trower
88.54 8.854000 Abigael Trowbridge
177.89 9.362632 Abraham Skudder
104.84 7.488571 Adah Cotty
143.15 10.225000 Addie Basterfield
47.75 9.550000 Addison Sigg
161.51 9.500588 Adrian Kunzelmann

The ROUND() function in SQL is used to round a numeric value to a specified number of decimal places. It takes two arguments: the number to be rounded and the number of decimal places to round to.


SELECT 
    SUM(ol.price) AS "Sum of Orders per Customer",
    ROUND(AVG(ol.price),2) AS "Average Order Price per Customer",
    CONCAT(c.first_name, ' ', c.last_name) AS "Full Name"
FROM 
    order_line ol
INNER JOIN 
    cust_order co ON ol.order_id = co.order_id
INNER JOIN 
    customer c ON co.customer_id = c.customer_id
GROUP BY 
    c.first_name, c.last_name
ORDER BY 
    "Order per Customer" DESC;
Displaying records 1 - 10
Sum of Orders per Customer Average Order Price per Customer Full Name
112.44 11.24 Abbot Tesseyman
31.84 10.61 Abbott Kendrew
119.68 13.30 Abby Chevins
59.10 7.39 Abel Trower
88.54 8.85 Abigael Trowbridge
177.89 9.36 Abraham Skudder
104.84 7.49 Adah Cotty
143.15 10.22 Addie Basterfield
47.75 9.55 Addison Sigg
161.51 9.50 Adrian Kunzelmann

Variance

The VARIANCE() function is used to calculate the statistical variance of a set of numeric values. The variance defines a measure of the spread dispersion within a set of data.

However, SQLite does not have a built-in VARIANCE() function.

To calculate variance in SQLite, one can use a combination of SQL functions.

  • Calculating Sample Variance -

The sample variance estimates the variance from a sample of the population. The formula is -

\[ \sigma^2 = \frac{\sum (x_i - \bar{x})^2}{n - 1} \]

\[\sigma^2\] is the population variance

\[x_i\] represents each data point

\[\bar{x}\] is the sample mean

\[n - 1\] is the total number of data points in the population

Used when there’s only a sample and need the population variance needs to be estimated. The denominator is \[n - 1\] . \[n - 1\] instead of \[n\] to correct the bias in the estimation of the population variance from a sample (Bessel’s correction).

The SQL code -

SELECT SUM((value - avg_value) * (value - avg_value)) / (COUNT(*) - 1) AS sample_variance FROM ( SELECT value, AVG(value) AS avg_value FROM table_name ) AS subquery;


WITH OrderStats AS (
    SELECT 
        c.customer_id,
        SUM(ol.price) AS "Sum of Orders per Customer",
        AVG(ol.price) AS "Average Order Price per Customer",
        COUNT(ol.price) AS "Order Count",
        CONCAT(c.first_name, ' ', c.last_name) AS "Full Name"
    FROM 
        order_line ol
    INNER JOIN 
        cust_order co ON ol.order_id = co.order_id
    INNER JOIN 
        customer c ON co.customer_id = c.customer_id
    GROUP BY 
        c.customer_id, c.first_name, c.last_name
)
SELECT
    os."Sum of Orders per Customer",
    ROUND(os."Average Order Price per Customer", 2) AS "Average Order Price per Customer",
    os."Full Name",
    ROUND(SUM((ol.price - os."Average Order Price per Customer") * (ol.price - os."Average Order Price per Customer")) / (os."Order Count" - 1), 2) AS "Order Price Sample Variance"
FROM
    order_line ol
INNER JOIN
    cust_order co ON ol.order_id = co.order_id
INNER JOIN
    customer c ON co.customer_id = c.customer_id
INNER JOIN
    OrderStats os ON c.customer_id = os.customer_id
GROUP BY
    os.customer_id, os."Sum of Orders per Customer", os."Average Order Price per Customer", os."Full Name"
ORDER BY
    "Sum of Orders per Customer" DESC;
Displaying records 1 - 10
Sum of Orders per Customer Average Order Price per Customer Full Name Order Price Sample Variance
711.41 10.94 Rich Kirsz 32.82
663.41 10.21 Farand Tremmil 34.46
635.58 10.42 Renado Sherrington 34.50
622.31 12.20 Lynda Roseborough 26.15
611.59 10.54 Mick Sever 38.09
596.18 11.25 Alysa Crombleholme 29.67
583.81 9.73 Emylee Hubbert 37.33
574.20 10.07 La verne Figg 30.94
561.26 9.85 Zora Hurles 42.98
549.63 10.78 Penny Bovingdon 37.10
  • Calculating Population Variance -

The population variance measures the dispersion of all data points in a population from the population mean \[\mu\]. The formula is -

\[ \sigma^2 = \frac{\sum (x_i - \mu)^2}{N} \]

\[\sigma^2\] is the population variance

\[x_i\] represents each data point

\[\mu\] is the mean of the population

\[n\] is the total number of data points in the population

With large datasets, the difference between sample and population variance is minimal since the correction factor (n - 1) vs. (n) has a smaller impact as the sample size increases.

The SQL code -

SELECT AVG((value - avg_value) * (value - avg_value)) AS population_variance FROM ( SELECT value, AVG(value) AS avg_value FROM table_name ) AS subquery;


WITH OrderStats AS (
    SELECT 
        c.customer_id,
        SUM(ol.price) AS "Sum of Orders per Customer",
        AVG(ol.price) AS "Average Order Price per Customer",
        COUNT(ol.price) AS "Order Count",
        CONCAT(c.first_name, ' ', c.last_name) AS "Full Name"
    FROM 
        order_line ol
    INNER JOIN 
        cust_order co ON ol.order_id = co.order_id
    INNER JOIN 
        customer c ON co.customer_id = c.customer_id
    GROUP BY 
        c.customer_id, c.first_name, c.last_name
)
SELECT
    os."Sum of Orders per Customer",
    ROUND(os."Average Order Price per Customer", 2) AS "Average Order Price per Customer",
    os."Full Name",
    ROUND(SUM((ol.price - os."Average Order Price per Customer") * (ol.price - os."Average Order Price per Customer")) / os."Order Count", 2) AS "Order Price Population Variance"
FROM
    order_line ol
INNER JOIN
    cust_order co ON ol.order_id = co.order_id
INNER JOIN
    customer c ON co.customer_id = c.customer_id
INNER JOIN
    OrderStats os ON c.customer_id = os.customer_id
GROUP BY
    os.customer_id, os."Sum of Orders per Customer", os."Average Order Price per Customer", os."Full Name"
ORDER BY
    "Sum of Orders per Customer" DESC;
Displaying records 1 - 10
Sum of Orders per Customer Average Order Price per Customer Full Name Order Price Population Variance
711.41 10.94 Rich Kirsz 32.31
663.41 10.21 Farand Tremmil 33.93
635.58 10.42 Renado Sherrington 33.94
622.31 12.20 Lynda Roseborough 25.64
611.59 10.54 Mick Sever 37.44
596.18 11.25 Alysa Crombleholme 29.11
583.81 9.73 Emylee Hubbert 36.71
574.20 10.07 La verne Figg 30.40
561.26 9.85 Zora Hurles 42.23
549.63 10.78 Penny Bovingdon 36.37

Standard Deviation

Standard deviation measures the dispersion of a dataset relative to its mean, indicating how spread out the data points are. A low standard deviation means the data points are close to the mean, while a high standard deviation indicates they are more spread out.

SQLite does not have a built-in STDDEV() or STDEV() function for calculating standard deviation.

These are the steps to calculate standard deviation in SQLite:

  1. Calculate the mean of the dataset.

  2. Compute Squared Differences calculate the squared difference of each value from the mean.

  3. Aggregate and Calculate - Sum up the squared differences, divide by the count of values (for population standard deviation) or by (n - 1) (for sample standard deviation), and take the square root of the result.

The population standard deviation (()) is calculated using the formula:

\[ \sigma = \sqrt{\frac{\sum_{i=1}^{N} (x_i - \mu)^2}{N}} \]

where: - () is the population standard deviation, - (x_i) represents each data point, - () is the mean of the population, - (N) is the total number of data points in the population.

The SQL code -

SELECT SQRT(AVG((value - avg_value) * (value - avg_value))) AS population_stddev FROM ( SELECT value, AVG(value) OVER () AS avg_value FROM table_name ) AS subquery;

WITH OrderStats AS (
    SELECT 
        c.customer_id,
        SUM(ol.price) AS "Sum of Orders per Customer",
        AVG(ol.price) AS "Average Order Price per Customer",
        COUNT(ol.price) AS "Order Count",
        CONCAT(c.first_name, ' ', c.last_name) AS "Full Name"
    FROM 
        order_line ol
    INNER JOIN 
        cust_order co ON ol.order_id = co.order_id
    INNER JOIN 
        customer c ON co.customer_id = c.customer_id
    GROUP BY 
        c.customer_id, c.first_name, c.last_name
)
SELECT
    os."Sum of Orders per Customer",
    ROUND(os."Average Order Price per Customer", 2) AS "Average Order Price per Customer",
    os."Full Name",
    ROUND(SUM((ol.price - os."Average Order Price per Customer") * (ol.price - os."Average Order Price per Customer")) / os."Order Count", 2) AS "Order Price Population Variance",
    ROUND(SQRT(SUM((ol.price - os."Average Order Price per Customer") * (ol.price - os."Average Order Price per Customer")) / os."Order Count"), 2) AS "Order Price Standard Deviation"
FROM
    order_line ol
INNER JOIN
    cust_order co ON ol.order_id = co.order_id
INNER JOIN
    customer c ON co.customer_id = c.customer_id
INNER JOIN
    OrderStats os ON c.customer_id = os.customer_id
GROUP BY
    os.customer_id, os."Sum of Orders per Customer", os."Average Order Price per Customer", os."Full Name"
ORDER BY
    "Sum of Orders per Customer" DESC;
Displaying records 1 - 10
Sum of Orders per Customer Average Order Price per Customer Full Name Order Price Population Variance Order Price Standard Deviation
711.41 10.94 Rich Kirsz 32.31 5.68
663.41 10.21 Farand Tremmil 33.93 5.82
635.58 10.42 Renado Sherrington 33.94 5.83
622.31 12.20 Lynda Roseborough 25.64 5.06
611.59 10.54 Mick Sever 37.44 6.12
596.18 11.25 Alysa Crombleholme 29.11 5.39
583.81 9.73 Emylee Hubbert 36.71 6.06
574.20 10.07 La verne Figg 30.40 5.51
561.26 9.85 Zora Hurles 42.23 6.50
549.63 10.78 Penny Bovingdon 36.37 6.03

The sample standard deviation ((s)) is calculated using the formula:

\[ s = \sqrt{\frac{\sum_{i=1}^{n} (x_i - \bar{x})^2}{n - 1}} \]

where: - (s) is the sample standard deviation, - (x_i) represents each data point in the sample, - ({x}) is the sample mean, - (n) is the number of data points in the sample.

The SQL code -

SELECT SQRT(SUM((value - avg_value) * (value - avg_value)) / (COUNT(*) - 1)) AS sample_stddev FROM ( SELECT value, AVG(value) OVER () AS avg_value FROM table_name ) AS subquery;


WITH SampledCustomers AS (
    SELECT 
        c.customer_id,
        c.first_name,
        c.last_name
    FROM 
        customer c
    ORDER BY 
        RANDOM()
    LIMIT 50
),
OrderStats AS (
    SELECT 
        c.customer_id,
        SUM(ol.price) AS "Sum of Orders per Customer",
        AVG(ol.price) AS "Average Order Price per Customer",
        COUNT(ol.price) AS "Order Count",
        CONCAT(c.first_name, ' ', c.last_name) AS "Full Name"
    FROM 
        order_line ol
    INNER JOIN 
        cust_order co ON ol.order_id = co.order_id
    INNER JOIN 
        SampledCustomers c ON co.customer_id = c.customer_id
    GROUP BY 
        c.customer_id, c.first_name, c.last_name
)
SELECT
    os."Sum of Orders per Customer",
    ROUND(os."Average Order Price per Customer", 2) AS "Average Order Price per Customer",
    os."Full Name",
    ROUND(SUM((ol.price - os."Average Order Price per Customer") * (ol.price - os."Average Order Price per Customer")) / (os."Order Count" - 1), 2) AS "Order Price Sample Variance",
    ROUND(SQRT(SUM((ol.price - os."Average Order Price per Customer") * (ol.price - os."Average Order Price per Customer")) / (os."Order Count" - 1)), 2) AS "Order Price Sample Standard Deviation"
FROM
    order_line ol
INNER JOIN
    cust_order co ON ol.order_id = co.order_id
INNER JOIN
    customer c ON co.customer_id = c.customer_id
INNER JOIN
    OrderStats os ON c.customer_id = os.customer_id
GROUP BY
    os.customer_id, os."Sum of Orders per Customer", os."Average Order Price per Customer", os."Full Name"
ORDER BY
    "Sum of Orders per Customer" DESC;
Displaying records 1 - 10
Sum of Orders per Customer Average Order Price per Customer Full Name Order Price Sample Variance Order Price Sample Standard Deviation
583.81 9.73 Emylee Hubbert 37.33 6.11
549.63 10.78 Penny Bovingdon 37.10 6.09
462.80 10.06 Murray Senussi 28.28 5.32
354.02 9.08 Shae Leser 33.24 5.77
264.18 10.16 Ronnie Thaim 27.34 5.23
247.00 11.23 Shayna Farbrother 32.07 5.66
213.33 9.70 Urbanus Renfield 33.70 5.81
197.43 10.39 Wain Michel 34.13 5.84
153.72 10.25 Amalita Digginson 58.57 7.65
126.77 7.92 Heinrik Sibylla 21.70 4.66

3 - WITH() function

WITH(), also known as Common Table Expressions (CTEs), allows for improved readability and reusability of SQL queries. It’s particularly useful for breaking down complex queries into simpler, more manageable parts by creating temporary result sets that can be referenced within the main query.

WITH cte_name AS ( – CTE Query SELECT … ) SELECT … FROM cte_name

Key Differences

  1. Structure and Readability -

Subqueries - Can be less readable, especially when nested.

WITH() - Provide a clearer, more structured approach by defining temporary tables with meaningful names.

  1. Reusability -

Subqueries - Generally not reusable; you need to repeat the subquery if it’s used in multiple places.

WITH() - Reusable within the main query, reducing redundancy and improving maintainability.

  1. Debugging and Maintenance -

Subqueries - Harder to debug and maintain due to their nested nature.

WITH() - Easier to debug and maintain due to their clear, modular structure.

When using WITH()


-- Step 1: Calculate the total amount spent by each customer
WITH CustomerSpending AS (
    SELECT 
        co.customer_id,
        CONCAT(c.first_name, ' ', c.last_name) AS full_name,
        SUM(ol.price) AS total_spent
    FROM 
        cust_order co
    INNER JOIN 
        order_line ol ON co.order_id = ol.order_id
    INNER JOIN 
        customer c ON co.customer_id = c.customer_id
    GROUP BY 
        co.customer_id, full_name
)

-- Step 2: Select the top 5 customers and join with book details
SELECT
    b.title,
    ol.price,
    cs.full_name,
    cs.total_spent
FROM 
    CustomerSpending cs
INNER JOIN 
    cust_order co ON cs.customer_id = co.customer_id
INNER JOIN 
    order_line ol ON co.order_id = ol.order_id
INNER JOIN 
    book b ON ol.book_id = b.book_id
WHERE 
    cs.customer_id IN (
        SELECT customer_id 
        FROM CustomerSpending 
        ORDER BY total_spent DESC 
        LIMIT 5
    )
ORDER BY 
    cs.total_spent DESC;
Displaying records 1 - 10
title price full_name total_spent
The Wall of the Sky the Wall of the Eye 1.15 Rich Kirsz 711.41
The Call of Earth (Homecoming #2) 11.20 Rich Kirsz 711.41
Prisoner’s Dilemma 7.33 Rich Kirsz 711.41
Of Love and Shadows 14.97 Rich Kirsz 711.41
Berlin Blues 9.92 Rich Kirsz 711.41
Triptych (Will Trent #1) 0.76 Rich Kirsz 711.41
If Beale Street Could Talk 11.44 Rich Kirsz 711.41
No Country for Old Men 8.26 Rich Kirsz 711.41
The Blood of the Moon: Understanding the Historic Struggle Between Islam and Western Civilization 13.00 Rich Kirsz 711.41
À tout jamais 14.56 Rich Kirsz 711.41

The part with the WITH inner query -

  • This CTE calculates the total amount spent on each order for every customer.

  • It joins the cust_order,order_line, and customers tables and groups the data by customer ID and full name to compute the total order amount.

The rest, the outer query -

  • The main query selects from the CustomerSpending CTE.

  • It also gets the book title and price from the book and order line tables respectively, by INNER JOINs.

  • A WHERE clause specifies the results should only contains the top 5 spending customers in CustomerSpending.

  • The results are ordered by the top spenders, descending.


4 - Further Data Manipulation in SQL

the ‘like’ operator

knitr::include_graphics("C:/Users/gam55/Downloads/gravity_bookstore_erd.png")

The LIKE operator is used for pattern matching within text fields. It allows the use of % to represent zero or more characters and _ to represent a single character. For example, the query SELECT * FROM customers WHERE name LIKE '%enko'; will return all rows where the name column ends with “enko”.

Getting all surnames starting with ‘Mac’ -

select * from customer where last_name like 'Mac%'
Displaying records 1 - 10
customer_id first_name last_name email
114 Aurelie MacSherry
211 Desiri MacDunleavy
299 Yvonne Maccree
416 Zaria MacCafferky
481 Kaleb MacClenan
500 Aurelia MacGowan
604 Agace MacKinnon
951 Wilma MacGorrie
979 Danell Mace
994 Archibold MacNab

Getting all surnames starting with ‘Mc’ -

select * from customer where last_name like 'Mc%'
Displaying records 1 - 10
customer_id first_name last_name email
73 Ruthi McGeever
168 Lynsey McPeice
187 Miner McLay
199 Bjorn McCloud
317 Stesha McAlees
319 Win McManamon
347 Ephrem Mc Ilwrick
410 Rriocard McPhail
539 Jane McCreath
772 Tybalt McOwen

Getting all surnames ending with ‘vich’ -

select * from customer where last_name like '%vich'
2 records
customer_id first_name last_name email
578 Lynea Matskevich
1545 Deana Matusevich

Getting all surnames ending with ‘enko’ -

select * from customer where last_name like '%enko'
2 records
customer_id first_name last_name email
174 Zachery Hriinchenko
1062 Harold Izacenko

Getting all surnames starting or ending with ‘man’ (lower or upper case) -

select * from customer where last_name like '%man%'
Displaying records 1 - 10
customer_id first_name last_name email
218 Abbot Tesseyman
221 Parker Strangman
319 Win McManamon
360 Dulci Portman
366 Giulia Borrowman
402 Charlean Palphramand
413 Sigfried Mansel
430 Adrian Kunzelmann
483 Antonie Liebmann
501 Salome Elliman

SIMILAR TO

Making a new column specifically based on the like condition -


SELECT 
    last_name as "Customer Surname", 
    CASE 
        WHEN last_name LIKE '%enko' 
            OR last_name LIKE '%vich' 
            OR last_name LIKE '%vych' 
            OR last_name LIKE '%chuk' 
            OR last_name LIKE '%chyk' 
            OR last_name LIKE '%ski' 
            OR last_name LIKE '%sky' 
            OR last_name LIKE '%uk' 
            OR last_name LIKE '%ko' 
            OR last_name LIKE '%yshyn' 
            OR last_name LIKE '%iv' THEN 'Yes' 
        ELSE 'No' 
    END AS 'Potentially Ukrainian Surname?' 
FROM customer 
WHERE last_name LIKE '%enko' 
    OR last_name LIKE '%vich' 
    OR last_name LIKE '%vych' 
    OR last_name LIKE '%chuk' 
    OR last_name LIKE '%chyk' 
    OR last_name LIKE '%ski' 
    OR last_name LIKE '%sky' 
    OR last_name LIKE '%uk' 
    OR last_name LIKE '%ko' 
    OR last_name LIKE '%yshyn' 
    OR last_name LIKE '%iv';
Displaying records 1 - 10
Customer Surname Potentially Ukrainian Surname?
Hriinchenko Yes
Davidofski Yes
Bagniuk Yes
Georgievski Yes
Matskevich Yes
Malinowski Yes
Matschuk Yes
Izacenko Yes
Dmiterko Yes
Matusevich Yes

Unfortunately, SQLite that’s being used here does not support the command ‘similar to’ that could be used in PostGreSQL.

If PostGreSQL was being used, the above query could be shortened a lot -

SELECT last_name, CASE WHEN last_name SIMILAR TO ‘%(enko|vich|vych|chuk|chyk|ski|sky|uk|ko|yshyn|iv)’ THEN ‘YES’ ELSE ‘NO’ END AS “Ukrainian Surname” FROM customer WHERE last_name SIMILAR TO ‘%(enko|vich|vych|chuk|chyk|ski|sky|uk|ko|yshyn|iv)’;

SOUNDEX

knitr::include_graphics("C:/Users/gam55/Downloads/gravity_bookstore_erd.png")

SOUNDEX is a phonetic algorithm that indexes words by their sound when pronounced in English. This can be useful for matching words that sound similar but are spelled differently. SQLite has this function.


SELECT SOUNDEX('George');
1 records
SOUNDEX(‘George’)
G620

SELECT last_name AS 'Surnames' 
FROM customer
WHERE SOUNDEX(last_name) = SOUNDEX('Smyth');
2 records
Surnames
Smoth
Sineath

select soundex('Postgres'), soundex('Postgresss'), ('Postgres' = 'Postgresss'),
soundex('Postgres') = soundex('Postgresss')
1 records
soundex(‘Postgres’) soundex(‘Postgresss’) (‘Postgres’ = ‘Postgresss’) soundex(‘Postgres’) = soundex(‘Postgresss’)
P232 P232 0 1

A SOUNDEX code consists of a letter followed by three digits, representing the phonetic pattern of the word. Here’s how it is constructed:

  • First Letter: The first letter of the word is kept as the first letter of the SOUNDEX code.

  • Digits: The remaining letters are converted to digits based on their phonetic sound:

B, F, P, V → 1

C, G, J, K, Q, S, X, Z → 2

D, T → 3

L → 4

M, N → 5

R → 6

  • Similar Sounds: Adjacent letters that represent the same sound are collapsed into a single digit.

  • Vowels and Certain Letters: A, E, I, O, U, H, W, and Y are ignored unless they are the first letter.

  • Truncation/Zero Padding: The code is truncated to four characters if necessary, or zero-padded to ensure it is four characters long.

  • For example, the SOUNDEX code “P232” is generated from the word “Postgres” as follows:

    ‘P’ is the first letter.

    ‘o’ is ignored.

    ‘s’ maps to 2.

    ‘t’ maps to 3.

    ‘g’ maps to 2.

    Remaining letters (‘r’, ‘e’, ‘s’) are either ignored or do not change the pattern as the code is already four characters long.

Thus, “Postgres” becomes “P232”.


select difference ('Postgres', 'Postgresss') as "Difference between the strings Postgres and Postgresss"
1 records
Difference between the strings Postgres and Postgresss
4

The DIFFERENCE() function in SQL compares the SOUNDEX values of two strings and returns an integer value between 0 and 4, indicating the degree of similarity between the two strings. A result of 4 means the strings sound very similar, while a result of 0 means they sound very different. This function is particularly useful for fuzzy matching in text searches.

The levenshtein() function calculates the Levenshtein distance between two strings, which is the minimum number of single-character edits (insertions, deletions, or substitutions) required to change one string into the other. It is commonly used to measure the similarity between two strings, with a lower distance indicating greater similarity.

SQLite doesn’t have it unfortunately!

The HAVING Clause

SELECT 
    SUM(ol.price) AS "Sum of Orders per Customer",
    CONCAT(c.first_name, ' ', c.last_name) AS "Full Name"
FROM 
    order_line ol
INNER JOIN 
    cust_order co ON ol.order_id = co.order_id
INNER JOIN 
    customer c ON co.customer_id = c.customer_id
GROUP BY 
    c.first_name, c.last_name
ORDER BY 
    "Sum of Orders per Customer" DESC;
Displaying records 1 - 10
Sum of Orders per Customer Full Name
711.41 Rich Kirsz
663.41 Farand Tremmil
635.58 Renado Sherrington
622.31 Lynda Roseborough
611.59 Mick Sever
596.18 Alysa Crombleholme
583.81 Emylee Hubbert
574.20 La verne Figg
561.26 Zora Hurles
549.63 Penny Bovingdon

The HAVING clause is used to specify a condition for groups of rows created by the GROUP BY clause, similar to how the WHERE clause is used to specify a condition for individual rows.

SELECT 
    SUM(ol.price) AS "Sum of Orders per Customer",
    CONCAT(c.first_name, ' ', c.last_name) AS "Full Name"
FROM 
    order_line ol
INNER JOIN 
    cust_order co ON ol.order_id = co.order_id
INNER JOIN 
    customer c ON co.customer_id = c.customer_id
GROUP BY 
    c.first_name, c.last_name
HAVING
    "Sum of Orders per Customer" > 600
ORDER BY 
    "Sum of Orders per Customer" DESC;
5 records
Sum of Orders per Customer Full Name
711.41 Rich Kirsz
663.41 Farand Tremmil
635.58 Renado Sherrington
622.31 Lynda Roseborough
611.59 Mick Sever

5 - Window Functions

knitr::include_graphics("C:/Users/gam55/Downloads/gravity_bookstore_erd.png")

Window functions in SQL perform calculations across a set of table rows that are somehow related to the current row. Unlike aggregate functions, which return a single value for a group of rows, window functions can return multiple rows for each row in the query, preserving the row’s identity. These functions are often used for running totals, moving averages, and other cumulative calculations.

ROW_NUMBER()

ROW_NUMBER() - Assigns a unique sequential integer to rows within a partition of a result set.

Assigns a unique sequential integer to each row, ordered by “Sum of Orders per Customer” in descending order, useful for ranking albeit with ties (there can’t be 2 third placed rows for example).

SELECT 
    SUM(ol.price) AS "Sum of Orders per Customer",
    CONCAT(c.first_name, ' ', c.last_name) AS "Full Name",
    ROW_NUMBER() OVER (ORDER BY SUM(ol.price) DESC) AS "Row Number"
FROM 
    order_line ol
INNER JOIN 
    cust_order co ON ol.order_id = co.order_id
INNER JOIN 
    customer c ON co.customer_id = c.customer_id
GROUP BY 
    c.first_name, c.last_name
ORDER BY 
    "Sum of Orders per Customer" DESC;
Displaying records 1 - 10
Sum of Orders per Customer Full Name Row Number
711.41 Rich Kirsz 1
663.41 Farand Tremmil 2
635.58 Renado Sherrington 3
622.31 Lynda Roseborough 4
611.59 Mick Sever 5
596.18 Alysa Crombleholme 6
583.81 Emylee Hubbert 7
574.20 La verne Figg 8
561.26 Zora Hurles 9
549.63 Penny Bovingdon 10

RANK()

Assigns a rank to each row, with gaps in the ranking sequence for ties.

Assigns a rank to each customer based on their “Sum of Orders per Customer”, allowing for ties where customers with the same sum receive the same rank. No 89th position just two 88 positions.

WITH RankedCustomers AS (
    SELECT 
        SUM(ol.price) AS "Sum of Orders per Customer",
        CONCAT(c.first_name, ' ', c.last_name) AS "Full Name",
        RANK() OVER (ORDER BY SUM(ol.price) DESC) AS "Rank"
    FROM 
        order_line ol
    INNER JOIN 
        cust_order co ON ol.order_id = co.order_id
    INNER JOIN 
        customer c ON co.customer_id = c.customer_id
    GROUP BY 
        c.first_name, c.last_name
)
SELECT *
FROM RankedCustomers
WHERE "Rank" BETWEEN 85 AND 95
ORDER BY "Rank";
Displaying records 1 - 10
Sum of Orders per Customer Full Name Rank
337.07 Krishnah Traite 85
336.50 Cecily Norley 86
336.23 Emory Bagniuk 87
334.23 Francis Cutmore 88
334.23 Ruthanne Vatini 88
331.91 Leone Utterson 90
330.95 Nicolina Scorthorne 91
329.13 Vladamir Ellerey 92
328.39 Berti Gowlett 93
328.25 Daisey Lamball 94

DENSE_RANK()

DENSE_RANK() does not leave gaps and provides consecutive ranking numbers regardless of the number of ties whereas RANK() leaves gaps in the ranking sequence if there are ties. An 89th position as well as two 88 positions.


WITH RankedCustomers AS (
    SELECT 
        SUM(ol.price) AS "Sum of Orders per Customer",
        CONCAT(c.first_name, ' ', c.last_name) AS "Full Name",
        DENSE_RANK() OVER (ORDER BY SUM(ol.price) DESC) AS "Dense Rank"
    FROM 
        order_line ol
    INNER JOIN 
        cust_order co ON ol.order_id = co.order_id
    INNER JOIN 
        customer c ON co.customer_id = c.customer_id
    GROUP BY 
        c.first_name, c.last_name
)
SELECT *
FROM RankedCustomers
WHERE "Dense Rank" BETWEEN 85 AND 95
ORDER BY "Dense Rank";
Displaying records 1 - 10
Sum of Orders per Customer Full Name Dense Rank
337.07 Krishnah Traite 85
336.50 Cecily Norley 86
336.23 Emory Bagniuk 87
334.23 Francis Cutmore 88
334.23 Ruthanne Vatini 88
331.91 Leone Utterson 89
330.95 Nicolina Scorthorne 90
329.13 Vladamir Ellerey 91
328.39 Berti Gowlett 92
328.25 Daisey Lamball 93

NTILE()

Distributes the rows into a specified number of buckets (here, deciles), providing a way to understand distribution across different segments.

SELECT 
    SUM(ol.price) AS "Sum of Orders per Customer",
    CONCAT(c.first_name, ' ', c.last_name) AS "Full Name",
NTILE(10) OVER (ORDER BY SUM(ol.price) DESC) AS "Decile"
FROM 
    order_line ol
INNER JOIN 
    cust_order co ON ol.order_id = co.order_id
INNER JOIN 
    customer c ON co.customer_id = c.customer_id
GROUP BY 
    c.first_name, c.last_name
ORDER BY 
    "Sum of Orders per Customer" DESC;
Displaying records 1 - 10
Sum of Orders per Customer Full Name Decile
711.41 Rich Kirsz 1
663.41 Farand Tremmil 1
635.58 Renado Sherrington 1
622.31 Lynda Roseborough 1
611.59 Mick Sever 1
596.18 Alysa Crombleholme 1
583.81 Emylee Hubbert 1
574.20 La verne Figg 1
561.26 Zora Hurles 1
549.63 Penny Bovingdon 1

LEAD()

Provides access to the next value in the query result, useful for comparing each customer’s sum with the next customer’s sum.

SELECT 
    SUM(ol.price) AS "Sum of Orders per Customer",
    CONCAT(c.first_name, ' ', c.last_name) AS "Full Name",
LEAD(SUM(ol.price)) OVER (ORDER BY SUM(ol.price) DESC) AS "Next Customer's Sum"
FROM 
    order_line ol
INNER JOIN 
    cust_order co ON ol.order_id = co.order_id
INNER JOIN 
    customer c ON co.customer_id = c.customer_id
GROUP BY 
    c.first_name, c.last_name
ORDER BY 
    "Sum of Orders per Customer" DESC
LIMIT 5;
5 records
Sum of Orders per Customer Full Name Next Customer’s Sum
711.41 Rich Kirsz 663.41
663.41 Farand Tremmil 635.58
635.58 Renado Sherrington 622.31
622.31 Lynda Roseborough 611.59
611.59 Mick Sever 596.18

LAG()

The converse of LEAD(). LAG() provides access to the previous value in the query result, useful for comparing each customer’s sum with the next customer’s sum.

SELECT 
    SUM(ol.price) AS "Sum of Orders per Customer",
    CONCAT(c.first_name, ' ', c.last_name) AS "Full Name",
LAG(SUM(ol.price)) OVER (ORDER BY SUM(ol.price) DESC) AS "Previous Customer's Sum"
FROM 
    order_line ol
INNER JOIN 
    cust_order co ON ol.order_id = co.order_id
INNER JOIN 
    customer c ON co.customer_id = c.customer_id
GROUP BY 
    c.first_name, c.last_name
ORDER BY 
    "Sum of Orders per Customer" DESC
LIMIT 5;
5 records
Sum of Orders per Customer Full Name Previous Customer’s Sum
711.41 Rich Kirsz NA
663.41 Farand Tremmil 711.41
635.58 Renado Sherrington 663.41
622.31 Lynda Roseborough 635.58
611.59 Mick Sever 622.31

FIRST_VALUE()

Returns the sum of orders for the top customer in the ordered list as a seperate column, giving a reference point for the highest sum of orders.

SELECT 
    SUM(ol.price) AS "Sum of Orders per Customer",
    CONCAT(c.first_name, ' ', c.last_name) AS "Full Name",
FIRST_VALUE(SUM(ol.price)) OVER (ORDER BY SUM(ol.price) DESC) AS "First Customer Sum"
FROM 
    order_line ol
INNER JOIN 
    cust_order co ON ol.order_id = co.order_id
INNER JOIN 
    customer c ON co.customer_id = c.customer_id
GROUP BY 
    c.first_name, c.last_name
ORDER BY 
    "Sum of Orders per Customer" DESC
LIMIT 5;
5 records
Sum of Orders per Customer Full Name First Customer Sum
711.41 Rich Kirsz 711.41
663.41 Farand Tremmil 711.41
635.58 Renado Sherrington 711.41
622.31 Lynda Roseborough 711.41
611.59 Mick Sever 711.41

LAST_VALUE()

Returns the sum of orders for the bottom customer in the ordered list as a seperate column, giving a reference point for the lowest sum of orders.

SELECT 
    SUM(ol.price) AS "Sum of Orders per Customer",
    CONCAT(c.first_name, ' ', c.last_name) AS "Full Name",
LAST_VALUE(SUM(ol.price)) OVER (ORDER BY SUM(ol.price) DESC) AS "Last Customer Sum"
FROM 
    order_line ol
INNER JOIN 
    cust_order co ON ol.order_id = co.order_id
INNER JOIN 
    customer c ON co.customer_id = c.customer_id
GROUP BY 
    c.first_name, c.last_name
ORDER BY 
    "Sum of Orders per Customer" ASC
LIMIT 5;
5 records
Sum of Orders per Customer Full Name Last Customer Sum
0.25 Frans Teague 0.25
0.29 Robinson Renfrew 0.29
0.62 Cordelie Rickertsen 0.62
0.72 Kasey Jillitt 0.72
1.32 Alfons Blabie 1.32

SUM(),AVG(),MAX() or MIN() with OVER()

SUM() with OVER() calculates the total sum of all orders across all customers, providing a grand total within the context of each row.

SELECT 
    SUM(ol.price) AS "Sum of Orders per Customer",
    CONCAT(c.first_name, ' ', c.last_name) AS "Full Name",
SUM(SUM(ol.price)) OVER () AS "Total Sum of All Customers"
FROM 
    order_line ol
INNER JOIN 
    cust_order co ON ol.order_id = co.order_id
INNER JOIN 
    customer c ON co.customer_id = c.customer_id
GROUP BY 
    c.first_name, c.last_name
ORDER BY 
    "Sum of Orders per Customer" DESC
LIMIT 5;
5 records
Sum of Orders per Customer Full Name Total Sum of All Customers
711.41 Rich Kirsz 152763
663.41 Farand Tremmil 152763
635.58 Renado Sherrington 152763
622.31 Lynda Roseborough 152763
611.59 Mick Sever 152763

AVG() with OVER() calculates the total sum of all orders across all customers, providing a grand total within the context of each row.

SELECT 
    SUM(ol.price) AS "Sum of Orders per Customer",
    CONCAT(c.first_name, ' ', c.last_name) AS "Full Name",
AVG(SUM(ol.price)) OVER () AS "Average Order Sum"
FROM 
    order_line ol
INNER JOIN 
    cust_order co ON ol.order_id = co.order_id
INNER JOIN 
    customer c ON co.customer_id = c.customer_id
GROUP BY 
    c.first_name, c.last_name
ORDER BY 
    "Sum of Orders per Customer" DESC
LIMIT 5;
5 records
Sum of Orders per Customer Full Name Average Order Sum
711.41 Rich Kirsz 115.8173
663.41 Farand Tremmil 115.8173
635.58 Renado Sherrington 115.8173
622.31 Lynda Roseborough 115.8173
611.59 Mick Sever 115.8173

MAX() with OVER() returns the maximum sum of orders found in the dataset, providing a reference for the highest order sum among all customers.

SELECT 
    SUM(ol.price) AS "Sum of Orders per Customer",
    CONCAT(c.first_name, ' ', c.last_name) AS "Full Name",
MAX(SUM(ol.price)) OVER () AS "Max. Sum of Orders"
FROM 
    order_line ol
INNER JOIN 
    cust_order co ON ol.order_id = co.order_id
INNER JOIN 
    customer c ON co.customer_id = c.customer_id
GROUP BY 
    c.first_name, c.last_name
ORDER BY 
    "Sum of Orders per Customer" DESC
LIMIT 5;
5 records
Sum of Orders per Customer Full Name Max. Sum of Orders
711.41 Rich Kirsz 711.41
663.41 Farand Tremmil 711.41
635.58 Renado Sherrington 711.41
622.31 Lynda Roseborough 711.41
611.59 Mick Sever 711.41

MIN() with OVER() returns the minimum sum of orders found in the dataset, providing a reference for the highest order sum among all customers.

SELECT 
    SUM(ol.price) AS "Sum of Orders per Customer",
    CONCAT(c.first_name, ' ', c.last_name) AS "Full Name",
MIN(SUM(ol.price)) OVER () AS "Min. Sum of Orders"
FROM 
    order_line ol
INNER JOIN 
    cust_order co ON ol.order_id = co.order_id
INNER JOIN 
    customer c ON co.customer_id = c.customer_id
GROUP BY 
    c.first_name, c.last_name
ORDER BY 
    "Sum of Orders per Customer" DESC
LIMIT 5;
5 records
Sum of Orders per Customer Full Name Min. Sum of Orders
711.41 Rich Kirsz 0.25
663.41 Farand Tremmil 0.25
635.58 Renado Sherrington 0.25
622.31 Lynda Roseborough 0.25
611.59 Mick Sever 0.25

PARTITION BY(), CUME_DIST() and WIDTH_BUCKET

A partition is a subset of rows in the result set. When you use a window function with the PARTITION BY clause, the window function is applied to each partition separately. This is useful for performing calculations within each group of rows.


SELECT 
    a.city,
    SUM(ol.price) AS "Sum of Orders per Customer",
    CONCAT(c.first_name, ' ', c.last_name) AS "Full Name",
    RANK() OVER (PARTITION BY a.city ORDER BY SUM(ol.price) DESC) AS "Rank"
FROM 
    order_line ol
INNER JOIN 
    cust_order co ON ol.order_id = co.order_id
INNER JOIN 
    customer c ON co.customer_id = c.customer_id
INNER JOIN
    address a ON co.dest_address_id = a.address_id
GROUP BY 
    a.city, c.first_name, c.last_name
ORDER BY 
    a.city, "Rank";
Displaying records 1 - 10
city Sum of Orders per Customer Full Name Rank
A Yun Pa 123.08 Joelie Menhci 1
A Yun Pa 85.49 Mariette Tulley 2
A Yun Pa 22.89 Cristen Tooby 3
Acacías 58.79 Pierette Duplain 1
Acacías 55.95 Grantley Clyburn 2
Acacías 46.78 Bennie Trigg 3
Afántou 115.89 Constance Raeburn 1
Afántou 33.64 Linea Gaither 2
Aghavnadzor 171.43 York Londsdale 1
Aghavnadzor 35.28 Patsy Hicks 2

CUME_DIST() calculates the cumulative distribution of a value within a partition. It returns the relative position of a value within the partition as a number between 0 and 1. It shows the proportion of rows that have a value less than or equal to the current row’s value.

SELECT 
    a.city,
    SUM(ol.price) AS "Sum of Orders per Customer",
    CONCAT(c.first_name, ' ', c.last_name) AS "Full Name",
    CUME_DIST() OVER (PARTITION BY a.city ORDER BY SUM(ol.price) DESC) AS "Cumulative Distribution"
FROM 
    order_line ol
INNER JOIN 
    cust_order co ON ol.order_id = co.order_id
INNER JOIN 
    customer c ON co.customer_id = c.customer_id
INNER JOIN
    address a ON co.dest_address_id = a.address_id
WHERE
    a.city = 'A Yun Pa'  -- Filter for specific city here
GROUP BY 
    a.city, c.first_name, c.last_name
ORDER BY 
    a.city, "Sum of Orders per Customer" DESC;
3 records
city Sum of Orders per Customer Full Name Cumulative Distribution
A Yun Pa 123.08 Joelie Menhci 0.3333333
A Yun Pa 85.49 Mariette Tulley 0.6666667
A Yun Pa 22.89 Cristen Tooby 1.0000000

The above result means 100% of customers have a sum of orders less than or equal to Cristen’s, 2/3 less than or equal to Mariettes’s, and 1/3% less than or equal to Joelie’s.

The WIDTH_BUCKET function divides a continuous range of values into a specified number of equal-width buckets, returning the bucket number for each value. It is commonly used to categorize numeric data into intervals for analysis.

For example -

Below one can see that the salary column has been divided 10 times, from the values 0 to 150,000.

knitr::include_graphics("C:/Users/gam55/Downloads/width_bucket_screenshot.png")

SQLite does not support the WIDTH_BUCKET function directly but it can be simulated using a series of CASE statements.

SELECT 
    a.city,
    SUM(ol.price) AS "Sum of Orders per Customer",
    CONCAT(c.first_name, ' ', c.last_name) AS "Full Name",
    CASE 
        WHEN SUM(ol.price) < 100 THEN 3
        WHEN SUM(ol.price) BETWEEN 100 AND 200 THEN 2
        WHEN SUM(ol.price) BETWEEN 201 AND 300 THEN 1
        ELSE 4
    END AS "Bucket"
FROM 
    order_line ol
INNER JOIN 
    cust_order co ON ol.order_id = co.order_id
INNER JOIN 
    customer c ON co.customer_id = c.customer_id
INNER JOIN
    address a ON co.dest_address_id = a.address_id
GROUP BY 
    a.city, c.first_name, c.last_name
ORDER BY 
    a.city, "Sum of Orders per Customer" DESC;
Displaying records 1 - 10
city Sum of Orders per Customer Full Name Bucket
A Yun Pa 123.08 Joelie Menhci 2
A Yun Pa 85.49 Mariette Tulley 3
A Yun Pa 22.89 Cristen Tooby 3
Acacías 58.79 Pierette Duplain 3
Acacías 55.95 Grantley Clyburn 3
Acacías 46.78 Bennie Trigg 3
Afántou 115.89 Constance Raeburn 2
Afántou 33.64 Linea Gaither 3
Aghavnadzor 171.43 York Londsdale 2
Aghavnadzor 35.28 Patsy Hicks 3

WITH RECURSIVE

SQLite supports the WITH RECURSIVE clause, which allows you to write recursive CTEs. This feature is useful for querying hierarchical or recursive data, such as workplace hierarchy or family trees.

Some of the explanations below have been lifted from https://learnsql.com/blog/sql-recursive-cte/

Recap on ‘common table expressions’, CTEs

“The CTE (common table expression), also known as the WITH clause, is an SQL feature that returns a temporary data set that can be used by another query. As it’s a temporary result, it’s not stored anywhere, but it still can be referenced like you would reference any other table.”

“A recursive CTE references itself. It returns the result subset, then it repeatedly (recursively) references itself, and stops when it returns all the results.”

The general syntax of a non-recursive CTE -

WITH cte_name AS (cte_query_definition)

 

SELECT *

FROM   cte_name;

“The first part of the syntax is the CTE. It begins with the keyword WITH. Then you give your CTE a name. After you follow that by the AS keyword, you can define CTE in the parentheses.”

“The second part of the syntax is a simple SELECT statement. It is written immediately after the recursive CTE, without any commas, semicolons, or similar marks. The CTE is used in another query just like any other table. This is exactly what the SELECT statement does.”

Recursive CTEs

A recursive CTE references itself. It returns the result subset, then it repeatedly (recursively) references itself, and stops when it returns all the results.

The syntax for a recursive CTE is not too different from that of a non-recursive CTE -

WITH RECURSIVE cte_name AS (

cte_query_definition (the anchor member)

 

UNION ALL

 

cte_query_definition (the recursive member)

)

 

 

SELECT *

FROM   cte_name;

“Again, at the beginning of your CTE is the WITH clause. However, if you want your CTE to be recursive, then after WITH you write the RECURSIVE keyword. Then it’s business as usual: AS is followed by the parentheses with the CTE query definition. This first query definition is called the anchor member.”

“To connect the anchor member with the recursive member, you need to use the UNION or UNION ALL command. The recursive member is, obviously, the recursive part of CTE that will reference the CTE itself.”

Finding Hierarchy in a Workplace Example

The ‘Gravity Bookstore’ used so far in this document doesn’t have any kind of hierarchical data some needs to be created -

CREATE TABLE employee (
    employee_id INTEGER PRIMARY KEY,
    name TEXT,
    position TEXT,
    manager_id INTEGER
);
INSERT INTO employee (employee_id, name, position, manager_id) VALUES
(1, 'Alice', 'CEO', NULL),        -- Top-level management
(2, 'Bob', 'VP of Sales', 1),
(3, 'Carol', 'VP of Engineering', 1),
(4, 'Dave', 'Sales Manager', 2),
(5, 'Eve', 'Sales Executive', 4),
(6, 'Frank', 'Sales Executive', 4),
(7, 'Grace', 'Engineering Manager', 3),
(8, 'Heidi', 'Senior Engineer', 7),
(9, 'Ivan', 'Junior Engineer', 7),
(10, 'Judy', 'CTO', 1),
(11, 'Mallory', 'Sales Associate', 5),
(12, 'Oscar', 'Sales Associate', 5),
(13, 'Peggy', 'Engineering Intern', 9),
(14, 'Sybil', 'Product Manager', 3),
(15, 'Trent', 'DevOps Engineer', 7),
(16, 'Victor', 'Lead Architect', 10),
(17, 'Walter', 'Data Scientist', 16),
(18, 'Xavier', 'ML Engineer', 17),
(19, 'Yvonne', 'UX Designer', 14),
(20, 'Zara', 'Graphic Designer', 19);

select * from employee
Displaying records 1 - 10
employee_id name position manager_id
1 Alice CEO NA
2 Bob VP of Sales 1
3 Carol VP of Engineering 1
4 Dave Sales Manager 2
5 Eve Sales Executive 4
6 Frank Sales Executive 4
7 Grace Engineering Manager 3
8 Heidi Senior Engineer 7
9 Ivan Junior Engineer 7
10 Judy CTO 1
WITH RECURSIVE company_hierarchy AS (
  SELECT 
    employee_id,
    name,
    position,
    manager_id,
    0 AS hierarchy_level
  FROM 
    employee
  WHERE 
    manager_id IS NULL  
  
  UNION ALL
   
  SELECT 
    e.employee_id,
    e.name,
    e.position,
    e.manager_id,
    ch.hierarchy_level + 1
  FROM 
    employee e, 
    company_hierarchy ch
  WHERE 
    e.manager_id = ch.employee_id
)
 
SELECT 
  ch.name AS "Employee name",
  e.name AS "Boss name",
  ch.position AS "Employee Position",
  e.position AS "Boss Position",
  ch.hierarchy_level AS "Hierarchy Level"
FROM 
  company_hierarchy ch
LEFT JOIN 
  employee e ON ch.manager_id = e.employee_id
ORDER BY 
  ch.hierarchy_level, 
  ch.manager_id;
Displaying records 1 - 10
Employee name Boss name Employee Position Boss Position Hierarchy Level
Alice NA CEO NA 0
Bob Alice VP of Sales CEO 1
Carol Alice VP of Engineering CEO 1
Judy Alice CTO CEO 1
Dave Bob Sales Manager VP of Sales 2
Grace Carol Engineering Manager VP of Engineering 2
Sybil Carol Product Manager VP of Engineering 2
Victor Judy Lead Architect CTO 2
Eve Dave Sales Executive Sales Manager 3
Frank Dave Sales Executive Sales Manager 3

The above is a recursive query, so it starts with WITH RECURSIVE. The name of the CTE is company_hierarchy. After AS, the CTE definition is in the parentheses.

The first SELECT statement, the anchoring statement, selects all the employee table columns where the column manager_id is NULL. In short, it will select Alice, because only she has a NULL value in that column - starting the recursion from the top of the organizational structure. There’s also a column hierarchy_level with the value of 0. That means the head of the company’s level is 0 – they’re on top of the hierarchy.

UNION ALL is used to connect this SELECT statement with the second one, the recursive member. In the recursive member, all the columns from the table employees ares selected and the CTE company_hierarchy where the column boss_id is equal to the column id. Notice the part hierarchy_level + 1.This means that with every recursion, the CTE will add 1 to the previous hierarchy level, and it will do that until it reaches the end of the hierarchy. Also note that the CTE is treated as any other table. To finish defining the CTE, the brackets are closed.

Finally, there’s a third SELECT statement, outside of the CTE. It selects the columns that will show employees, their bosses’ names, and the hierarchy level. Data is taken from the CTE and the table employees. The CTE and employees table are joined with a LEFT JOIN, since we want all the data from the CTE – including Alice, who has the NULL value in the column boss_id.

In a very simplistic way -

“Recursive CTEs are used to handle and query data that refers to itself, like finding all employees who report to a manager in a company hierarchy.”